Bacterial composition in Swedish raw drinking water reveals three major interacting ubiquitous metacommunities

Abstract Background Surface raw water used as a source for drinking water production is a critical resource, sensitive to contamination. We conducted a study on Swedish raw water sources, aiming to identify mutually co‐occurring metacommunities of bacteria, and environmental factors driving such patterns. Methods The water sources were different regarding nutrient composition, water quality, and climate characteristics, and displayed various degrees of anthropogenic impact. Water inlet samples were collected at six drinking water treatment plants over 3 years, totaling 230 samples. The bacterial communities of DNA sequenced samples (n = 175), obtained by 16S metabarcoding, were analyzed using a joint model for taxa abundance. Results Two major groups of well‐defined metacommunities of microorganisms were identified, in addition to a third, less distinct, and taxonomically more diverse group. These three metacommunities showed various associations to the measured environmental data. Predictions for the well‐defined metacommunities revealed differing sets of favored metabolic pathways and life strategies. In one community, taxa with methanogenic metabolism were common, while a second community was dominated by taxa with carbohydrate and lipid‐focused metabolism. Conclusion The identification of ubiquitous persistent co‐occurring bacterial metacommunities in freshwater habitats could potentially facilitate microbial source tracking analysis of contamination issues in freshwater sources.


| INTRODUCTION
Access to clean water is of global importance to public health and a key factor in maintaining a well-functioning society. Future challenges-arising due to increased urbanization and climate change-are expected to reduce freshwater quality, resulting in increased particle and nutrient load, but also fecal pollution (Arnell et al., 2015;Howard et al., 2016;Vörösmarty et al., 2000). Due to better, more readily available sequencing capabilities at a lower cost, several recent studies have been able to assess the anthropogenic impact on bacterial composition in watersheds Llirós et al., 2014;Shen et al., 2019). These studies indicate that anthropogenic actions have a clear impact on bacterial diversity along eutrophicoligotrophic gradients. Eutrophication can disturb the microbial community composition, altering the carbon and nutrient cycling and, as a result, the entire aquatic ecosystem (Kiersztyn et al., 2019;Nyirabuhoro et al., 2020;Zeng et al., 2019). Regarding the impact of anthropogenic activity on microbial diversity and function, few long-term longitudinal studies have been conducted, emphasizing the need for increased knowledge of seasonal and interannual changes in biodiversity at the community level. Due to the high turnover rate of most prokaryotes, as compared to larger organisms, long-term trends in microbial communities are of particular interest, as these communities have the potential to change more over time, thus resulting in a faster response to anthropogenically induced perturbations.
Most studies of bacterial community composition in boreal lakes, which are of specific importance as sources of drinking water in temperate regions, have focused on inferring factors shaping bacterial community structure and possible correlations within the communities. These studies have described waters in the Nordic countries (Eiler & Bertilsson, 2007;Eiler et al., 2012Eiler et al., , 2013Peura et al., 2012) and in other similar environments, such as freshwater bog lakes in the northern United States (Linz et al., 2017) and boreal lakes in Québec, Canada (Cheaib et al., 2018;Niño-García et al., 2017) as well as lakes and ponds across Europe (Bock et al., 2020). Eiler et al. (2012) examined the temporal dynamics of bacterioplankton communities in Lake Erken situated in eastern Sweden and found temporal trajectories over annual cycles and complex inter-dependencies within communities which point toward the importance of biotic interactions (such as direct competition/mutualism as well as less direct interaction) for shaping community structure.
However, as pointed out by Langenheder and Lindström (2019), a limitation of the aforementioned studies is that they have either focused on a long time series for a single lake or single/few time points across many lakes. Therefore, further studies of longitudinal data collected in multiple lakes are warranted to understand the complex associations of diversity, interactions within and between communities, and the influence of environmental and anthropogenic factors, to understand the general governing principles of microbial composition and ecology. From a water safety management point of view, large-scale longitudinal studies are required to better define the variability in the community as true perturbations due to external factors will be more easily identifiable and discernible from natural fluctuations by employing the results of such studies. Discriminating natural fluctuations from external anthropogenic changes in the composition will also greatly improve microbial source tracking performances Read et al., 2015).
In the present study, six Swedish raw water sources were sampled for 3 years. The water sources represent diverse environments including both anthropogenically affected and more undisturbed waters. The study aimed to describe the composition and inferred metabolic capabilities of the microbial communities across a timescale of several years to assess if factors linked to anthropogenic perturbation may shape the diversity, interactions, and capabilities of the present microbes and if a difference in these patterns between affected and pristine waters could be observed.  The sequences were demultiplexed using the deML software (Renaud et al., 2015). The reads were quality filtered and denoised using DADA2 (Callahan et al., 2016(Callahan et al., , 2017. The "filterAndTrim," "learnErrors," and "dada" functions were run on each of the eight amplicon sequencing data sets to train the error model specifically for each set so that heterogeneity in sequencing runs was accounted for. Read length truncation parameters were decided based on the Phred quality scores plot for each sequencing run and varied between 140 and 180 for forward reads and between 120 and 135 for reverse reads. The maxEE parameter was set to default (i.e., equal to 2.0).
Reads were truncated at the first instance of a quality score less than or equal to 11 (i.e., truncQ  (Table A3).

| Statistical analysis and visualization
The clustering of water samples and sites into two partitions, corresponding to anthropogenically disturbed and pristine environ-  (Yu et al., 2017) in R. Heatmaps were visualized using the ggheatmap function in the R package heatmaply (Galili et al., 2018). Detection of a phylogenetic signal for metacommunity distribution was performed with the delta-statistic method presented by Borges et al. (2019), using default settings of the delta function in R and 1000 permutations to create the distribution of delta under the null hypothesis of no signal between the trait and the phylogeny.

| Multivariate generalized linear modeling of interactions between taxa
To investigate the effect of environmental predictors on the communities and biotic interactions within, a multivariate generalized linear latent variable model (GLLVM) was fitted to the community data through the gllvm R package (Niku et al., 2019).
By modeling the response of abundance to predictors jointly with the correlation across taxa, we have the possibility of teasing the two apart by explicitly modeling the correlation structure via latent variables. In doing so, we can both estimate the effects of the environmental predictors and residual correlations jointly (Caradima et al., 2019;Ovaskainen et al., 2017;Warton et al., 2015). The ASV abundance matrix was set as a response variable and rescaled and centered average air temperature, turbidity, COD-Mn, and Color values were set as continuous predictors and water plant location and season of sampling were used as group-level predictors according to:

| RESULTS
To perform an in-depth study of the difference in water properties at the selected sites, representing a wide diversity in the catchment area and land use (Table 1 and Figure 1a To summarize the general diversity analysis of the raw water assemblages, we have identified a complex pattern of variation between sites, seasons, water quality parameters, and sequence taxonomy. This observed complexity implies that further modeling is warranted to disentangle the different sources of variation.

| Model analysis reveals three distinct and abundant metacommunities
To determine if bacterial assemblages responded to habitat characteristics and displayed signs of interactions among ASVs, a GLLVM was fitted to the ASV occurrence data, with turbidity, color value, air temperature, and COD-Mn included as covariates and location and season as factor level predictors in the model. Only the top 200 ASVs in terms of total abundance were included, with a high degree of ASVs being shared (i.e., present in at least 90% of samples and with an abundance of at least 0.1% of reads) between locations ( Figure A5). The selection of the most common taxa and the exclusion of comparably more rare taxa were performed as these likely represent biologically important species and are less sensitive to both, noise in the data (i.e., close to detection limits, PCR induced bias, database incompleteness) and problems with compositionality inherent to all sequencing analyses with technical limits to sequencing depth (Alteio et al., 2021). All inferred associations with 95% confidence intervals are provided in Table S1 available at 10.5281/ zenodo.7066483. When including all predictors in the model, 41% of the total variation was accounted for in the analysis, which better allows us to draw conclusions from the inferred ASV correlations after adjusting for the predictors ( Figure A6). By inspecting the inferred factor loadings of the model, the ASVs that explain most variation in ASV abundance were assigned to the phyla Proteobacteria and Bacteroidetes ( Figure A7). A variable importance analysis was performed where the geographic location effect was found to be most influential (of the included predictors) on composition (Table A1). Bacteroidetes representing a significant fraction of metacommunity 1 but mostly present in lower amounts in metacommunity 2.
Metacommunity 3 was almost exclusively composed of Bacteroidetes and Proteobacteria, in contrast to the other two communities, and showed a lower diversity at the phylum level. In contrast to community 1, which mostly mirrored the general diversity in the data set, communities 2 and 3 comprised taxa associated with specific conditions and consisted of mutually exclusionary and unique organisms. Worth noting is that members from the same metacommunity tended to cluster together in the phylogeny. This clustering distribution of metacommunities in the phylogeny was tested using delta-statistic, with an obtained δ m = 3.28 compared to δ 0 = 2.12 (0.63) of the null distribution (p = 0.000), which implies the presence of a phylogenetic signal (i.e., the ecological similarity between taxa is related to phylogenetic relatedness), showing that specific life-strategies adopted by taxa is governed to an extent by shared evolutionary history although this can vary between taxa even if comparing on the same taxonomic level. The spatiotemporal trends of the three metacommunities were accessed by investigating their relative contribution to each assemblage ( Figure A9). On average, metacommunity 1 was the most abundant.
Both, temporal dependency on metacommunity compositions and side effects were observed, where metacommunity 2 was most abundant (on average) in Östersund and Borås (average frequencies of 21.2% and 15.0%, respectively) while metacommunity 3 was most abundant in Motala and Trollhättan (24.5% and 16.0%, respectively).
In the GLLVM analysis, associations between the taxa and environmental, seasonal, geographic, and water quality variables were inferred: all significant associations are highlighted in  F I G U R E 5 PCoA plot of differential abundance values for (a) BioCyc metabolic pathways and (b) KEGG orthologues of the three metacommunities. Color coding of data points is according to results from model analysis designation to the three metacommunities where red, green, and blue correspond to metacommunities 1, 2, and 3, respectively. Ellipses correspond to 95% confidence intervals. KEGG, Kyoto Encyclopedia of Genes and Genomes; PCoA, principal coordinate analysis.

| A stable set of core taxa
The different sampling locations presented here represent a variety of trophic conditions and thus a diverse mixture of environments ranging from affected by anthropogenic action (i.e., a greater extent of farm and urban land use) to much less exposed areas. While the sampling locations fall on a gradient, they were divided into two groups for ease of interpretation, here called A (for anthropogenically affected) and P (pristine) as proposed by Numberger et al. (2020).

| Noncorrelation of physical and chemical water parameters with microbial community composition
Interestingly, the measured water quality/meteorological variables included in the GLLVM analysis (here CODMn, turbidity, air temperature, and color value) explained a relatively small portion of the total variation in community composition. Thus, the geographical and/or water quality differences at the sampling locations influenced the community composition more heavily in our study area. Spatial

| Effect of anthropogenic impact
We found evidence of taxa augmentation, manifested as approximately higher levels of alpha diversity in water sites in higher anthropogenic affected regions with higher levels of fecal indicator bacteria (Group A, Borås, Trollhättan, and Stockholm) than in those in less-urbanized regions (Group P, Östersund, Härnösand, and Motala, see also Hägglund et al., 2018). In addition, we found differences between group A and P in relative abundances among common The three distinct metacommunities partitioned from the top 200 abundant ASV were supported by two independent analyses using two data sources: that is, the co-occurrence data based solely on abundance and the predicted metabolic pathways of the bacterial communities, based on reconstructed metabolism. In other words, we found that the predicted functionality of the bacterial freshwater metacommunity resembles the inferred correlation pattern among ASVs after adjusting for design and water quality parameters.
To assess if members of the metacommunities were associated with water quality turbidity was used as a proxy. The connection between turbidity and other water quality indicators has been well studied, and significant correlations with pathogens Giardia spp. and Of note is that metacommunity 2, shown to be positively correlated with turbidity, displayed several orthologues present in pathways involved in methane metabolism at a high likelihood of increased abundance. As methanogenic and methanotrophic processes are associated with anaerobic metabolism predominantly, this may point to the possibility of anoxic micro-environments present in particle-associated niches common in waters with high turbidity, or direct association of methane-producing taxa with photoautotrophic species by direct transfer of substrates (Grossart et al., 2011).
Metacommunity 3 showed a higher abundance of orthologues associated with lipid metabolism (although this was not pronounced) as well as the highest abundance of carbohydrate metabolism-related orthologues among the three metacommunities, potentially linking this metacommunity with nutrient-rich waters, supported by the association of this metacommunity with elevated levels of COD-Mn (chemical oxygen demand).
In conclusion, the findings presented here show that bacterial communities at six Swedish raw drinking water sources are subjected to selective pressure from environmental and land use conditions. Anthropogenic perturbation results in an effect on the structure and composition of the bacterial communities although the microbial taxa constituting the anthropogenic signature differ.
Across this gradient, the communities were structured into three metacommunities which were present at all locations across the study period, albeit at different frequencies, and consisted of typical freshwater families such as Burkholderiaceae, Flavobacteriaceae, Commamonadeceae, and Pseudomonadaceae. Bacterial lineages within metacommunities showed strong correlation and, thus, preference for occupying the same ecological niches.
Between metacommunities, lineages correlated negatively. By predicting metabolic functions of the communities, the same metacommunity structure was recovered, supporting this finding.
An important goal for future research is to study competing and coexisting bacterial lineages to better understand their role when aquatic systems are impacted by anthropogenic stress. writingreview & editing (equal).

ACKNOWLEDGMENTS
We wish to thank Moa Hägglund and Emmy Borgmästars for their technical support. This project was funded by the Swedish Civil Contingencies Agency (project B40624).

CONFLICT OF INTEREST
None declared.

DATA AVAILABILITY STATEMENT
All sequencing data generated in Hägglund et al. (2018) is available at Short Read Archive, accession SRP159537: https://www.ncbi.nlm.

ETHICS STATEMENT
None required.

GLLVM analysis
This model, Equation (1) in the main article, allowed us to estimate correlations between ASVs while simultaneously accounting for predictor variables. Three latent variables were used in the analysis, which is believed to capture most of the relevant variation in composition: see example in Warton et al. (2015) and Niku et al. (2019). To improve convergence, jittering of latent variables was set to 0.5 (jitter.var = 0.5). The negative binomial distribution was selected to model the count data. For further details of the modeling, please consult Niku et al. (2017Niku et al. ( , 2019. To visualize the inferred regression parameters, we made use of the ggplot2 R package.
To reduce the dimension of the co-occurrence data, the top 200 ASVs were selected to be included in the analysis to avoid spurious correlations with low-abundant features close to the noise level. In the analysis, we first checked whether including predictors (i.e., full model as in Equation 1 in the main paper) improved the results in terms of accounting for a large proportion of the total variation in ASV abundance as compared to a model without the predictors (i.e., null model). By including all predictors in the model, 41% of the total variation was accounted for in the analysis, which better allows us to draw conclusions from the inferred ASV correlations after adjusting for the predictors ( Figure A3). For example, little or no side effects remain after including the corresponding site predictor in the model (i.e., no visible pattern in communities shown in Figure A6a), contrary to what was evident in the null model and the standard ordination results ( Figure A6b and Figure).
To check for collinearity, we calculated the pair-wise Pearson correlations between predictors ( Figure A11). One out of six estimated pair-wise correlations was large (r = 0.87) suggesting a potential problem of collinearity for CODMn and Color value. All other obtained estimates were between −0.1 and 0.3. Thus, they cannot independently predict the value of the dependent variable: they explain partly the same variance in the dependent variable.
Because of this, caution is needed when interpreting individual associations of CODMn and color value with ASVs.

GLLVM analysis
In the GLLVM analysis, associations were also determined between the bacterial assemblages and environmental, design, and Intriguingly two ASVs, both methanotrophs assigned to metacommunity 1, showed the largest overrepresentation in abundance across all seasons. In addition, one ASV assigned to the genus Cerasicoccus (phylum Verrucomicrobia) was overrepresented in both, the summer and autumn seasons. Furthermore, one ASV assigned to metacommunity 3 belonging to the genus Acinetobacter was also overrepresented in abundance across the seasons compared to spring.

(a) (b)
F I G U R E A2 Diversity plots highlighting differences in alpha diversity, grouped according to locale (a) and season (b). Brackets show significant differences in groups as determined by the analysis of variance test with * signifying p = 0.05, **p = 0.01, and ***p = 0.001. Before analysis sequences corresponding to mitochondria and chloroplasts were removed, otherwise unfiltered (but rarefied to even depth) data was used.
Investigations of site effects on ASVs revealed that generally, ASVs assigned to metacommunities 1 and 2 showed a higher number of associations than ASVs assigned to metacommunity 3, further reinforcing that metacommunity three members consist of generalist bacteria (Figure 5b)    F I G U R E A11 The estimated coefficient of small, but significant, effects corresponding ASVs associated with: (a) site effects where Stockholm DWTP is set as a reference and (b) season effects where spring is set as a reference level. The estimated mean value is shown as a point with 95% CI as lines around the point. The colors of the coefficients, red, green, and blue correspond to metacommunities 1, 2, and 3, respectively. ASV, amplicon sequence variant; CI, confidence interval; DWTP, drinking water treatment plant. Note: Higher scores indicate less good model performance. The NULL model did not include any predictors.
Abbreviations: AIC, Aikake information criterion score; BIC, Bayesian information criterion score; df, model degrees of freedom.
T A B L E A2 The top significant associations of the 200 included taxa and predictors in the GLLVM analysis results